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We present a no\ el computational method for automatic assignment of protein do- 
mains from stru< tural data. At the core of our algorithm lies a recently proposed 
clustering techni pie that has been very successful for image-partitioning applica- 
tions. This graph-theory based clustering method uses the notion of a normal- 
ized cut to partii ion an undirected graph into its strongly-connected components. 
Computer imple nentation of our method tested on the standard comparison set 
of proteins from the literature shows a high success rate (84%), better than most 
existing alternatives. In addition, several other features of our algorithm, such as 
reliance on few ; djustable parameters, linear run-time with respect to the size of 
the protein and educed complexity compared to other graph-theory based algo- 
rithms, would m ike it an attractive tool for structural biologists. 


1 Introduction 

Understanding the biological functions of proteins is one of the major chal- 

lenges of the post genomic era 1,2,3,4,5 . The task would be vastly simplified, 

if the correlation 1 etween protein structures and sequences is properly under- 

stood, because the three-dimensional (3D) shapes of proteins may provide vital 

clues about their functions. An important lesson learned from research on pro- 

tein structures is that the number of evolutionarily distinct proteins is finite. 

Families of evolutionarily related proteins share the same folding architecture. 

It is estimated tha the total number of such folding architectures is of the or- 
der of thousands 1 * * * . which is much smaller than the total protein space. Some 
of these proteins can be further decomposed into domains, broadly defined as 
compact sub-struc ures in the 3D structure of a protein 6,7,8 . Some domains 
can carry out spet ific functions and also fold autonomously 10 * * * . Due to such 
unique properties, domains are believed to serve as more fundamental units 
of evolution and in many ways more basic blocks of the protein universe 9 . 
Therefore, cataloging domains will enrich the database of protein domain fam- 
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(a) T-Cell Surface Glycoprotein (b) W matrix of protein 3CD4. 

(PDB code: :>CD4) contains 

two domains, as it. is clearly vis- 
ible in the figure. 


Figure 1: Molecule 3CD4 and its W matrix. 


ilies and eventualh help in protein homology detection. Moreover, recognizing 
the domain structures of a novel protein from its sequence will also improve 
structural prediction by threading. Because of such overwhelming importance 
of domains in structural classification as well as functional understanding of 
proteins, several databases (PFAM 11 , CATH 12 , SCOP 13 , DALI 1 ’ 14 ) maintain 
lists of domains of all the proteins of the Protein Data-bank (PDB) 15,16 , which 
is the largest database of protein structures. 

As the PDB database keeps growing exponentially, keeping the domain 
databases up-to-d; te turns out to be a challenging task. Initially, most of 
those databases were maintained manually with help from human experts. 
Lack of high-quality automated tools made the process slow and prone to 
errors. Although some automated algorithms have been proposed in recent 
years L?, 8,18,27,28,3^ they are still not very efficient 17 . After conducting an 
extensive comparison of all leading algorithms in 1998, Jones found that they 
worked correctly f< r only 65-75 percent of cases 17 . Moreover, the algorithms 
in her study all agieed only for 55 percent of the test-cases. Better techniques 
were proposed in tl e following years 7,18 , but the problem of automated protein 
decomposition is for from solved. 

Search for effic: ent domain decomposition techniques has been active since 
the 1970s 19 ’ 20 > 21 > 22 23 > 24 ; but the early efforts were inconclusive due to lack of 
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analyzed protein structures. The methods tried in the early years included 
analysis of C a -C° distance-maps 20 ’ 21 ' 22 , minimum packing density of C a 
atoms 19 , comparing interface area between two chains 24 , estimating maxi- 
mum buried surface area 2 **, etc. Attempts were also made to predict domains 
solely from the sequence data 25,26 . The problem of domain decomposition from 
structural data received renewed attention in recent years due to exponential 
growth in the size of PDB over the last decade 1 ^’ 1 ^. Sophisticated algorithms 
were suggested, some based on refinement of older ideas and some others using 
completely new c< ncepts. These recent algorithms tried to obtain domains 
using inter-residue contacts 3 , minimization of chain fragmentation , search 
for presence of hydrophobic cores 28 ’ 29 , inter-domain dynamics 1 ’ 30 , dendogram 
based on distance maps 27 , multi-state partitioning using an Ising chain type 
model 18 and a graph-theory based network-flow algorithm 7 . 

In this paper, we use a powerful, normalized-cut based approach to par- 
tition proteins into domains. It is a graph-theory based method that has 
been quite successful for image-partitioning applications. In the biological 
field, Xing et al applied a similar technique for clustering DNA-micro-array 
data 34 , but to the best of our knowledge, noone has applied it to the domain 
decomposition problem yet. A computer implementation of our algorithm, 
tried on standard test-set of 55 proteins from the literature 17 , shows 84% rate 
of success, higher than most other existing algorithms 7 ’ 17 . Among the 30 
single-domain proteins in the test-set, our method correctly identifies all but 
two. For the 20 two domain proteins, it achieves 80 percent rate of success. 
This success rate improves to 100 percent if the program knows beforehand 
that there are only two domains in the protein and then tries to find out the lo- 
cations of these domains. For multi-domain proteins, the success rate is poorer 
but we also note that the test-set itself is very small. Only 5 out of 55 proteins 
in Jones’ test-set bad more than two domains. Therefore the results may be 
inconclusive in thb case. 


Our method h is several other advantages. We find that even for proteins 
where our method gives incorrect results, the cut- value falls within a narrow 
range out of all p* ssible values. This can be very helpful for semi-automatic 
identification of domains where human assistance can be taken only for proteins 
where the cut values fall within the narrow range. In addition, our method 
needs very few adjustable parameters and runs in linear time with respect to 
the size of protein. Also the implementation is cleaner than other graph-theory 
based domain-decomposition algorithms, because our method does not need 
to add unphysical source and sink nodes 7 . 
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Fij.ure 2: Flow Chart showing steps in the algorithm. 

2 Methods 

In this section, w < explain our algorithm for obtaining domains of a protein 
from its structure (see Fig. [2]). For a given protein, we start with its structure 
downloaded from the PDB database and represent it as a weighted, undirected 
graph, where we a insider the C a atoms of the protein as vertices of the graph. 
Edge- weight Wij between any pair of vertices (i, j) is computed using the fol- 
lowing equation: 

hi = exp[- ( )] + Pij ( 1 ) 

li'ij — fij if /ij ^ 1 

= 1 if fij > I- (2) 
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In Eq. [2], d x] is thu distance between C a atoms of residues i and j. d Q , d are 
constant paramete s. takes only two values - constant positive parameter 
or 0, depending m whether residues i and j belong to the same beta-sheet 
or not. 

For the complete protein, edge- weights w t j can be represented by a sym- 
metric matrix W. Physically, elements of the matrix W represent the contacts 
between different residues of the protein. Larger values of Wij close to 1 cor- 
respond to closely oeated residues i and j of the protein. On the other hand, 
a small value of w,j implies that the residues i and j are far apart. The par- 
ticular functional form for it/y given by Eq. [2] turns out to be not important 
for our partitioning method. We can choose other functional forms as well. As 
long as the terms in W represent how strongly the amino acids are connected, 
with higher value for stronger connection, our partitioning algorithm would 
work. Therefore, the W matrix can be chosen to be as simple as 1(0) based 
on whether distance between Ca atoms of two residues are less than(greater 
than) a certain cutoff value® or based on some very sophisticated approach . 

Once we repre: ent the protein as a graph as described above, our problem 
of identification of domains translates into identifying the connected compo- 
nents of the graph that are weakly connected to each other. Efficient graph- 
theoretic algorithn s exist to solve this problem. Among them, we choose the 
one that partition; the graph using the notion of a normalized cut. It is a 
clustering algorithm used routinely in the image-partitioning field 33 . In the 
graph-theoretic lai guage, our problem can be stated as follows: we seek to 
partition the vertices V of a weighted, undirected graph G{V, W) into two dis- 
joint sets Vi and V 2 . where the contact is large between the vertices of the same 
set and small between vertices of different sets. Intuitively, a good partition 
(Vj, V 2 ) of the graph (V) should minimize the sum of the weights for edges 
connecting two sul groups. Mathematically, we need to minimize the sum 

cuf(Vi,V 2 ) = EE's (3) 

iev i jev 2 


to get the best paitition. Unfortunately, such an approach is prone to result 
in unbalanced partitions. We can check easily that a partition with only one 
vertex in one subgioup (Vi) and the rest of the vertices in other subgroup (V 2 ) 
may give very low sum in Eq. [3], but that is not the optimum solution that 
we are looking for. Therefore the function in Eq. [3] needs to be normalized to 
get the correct answer. 

There are mai y possible ways to normalize the function in Eq. [3]. Shi 
and Malik studied many such alternatives 33 and found that minimization of 
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the function 


A mt(V \ , V 2 ) 


CUt(Vl, V2) CMt(Vl, V2) 
assoc(Vi, V”) assoc(V 2 , V r ) ’ 


(4) 


gives the optimum partition of the graph. In Eq. [4] asso^V^V) is defined 
as 

assoc(V m ,V) = 2_j z2 w ir ( 5 ) 

i€V m j€V 

It can be shown t) at Ncut has a maximum range of (0, 2). 

In order to fii d out the best partition of the vertices of the graph, one 
obvious approach would be to go over all possible partitions of the set V, 
compute Ncut foi each partition using Eq. [4] and then pick the partition 
that gives lowest Strut. However, such an approach is NP-complete 33 and 
therefore not prac ically feasible for large proteins. However, it is possible to 
get an approximat ■ solution for the partitioning problem in the following way. 
We consider the tl e generalized eigenvalue problem: 


(D - W)v = XDv, 


(6) 


where W is the edge- weight matrix of the graph and D is a diagonal matrix with 
each element as a t urn of the rows of W. After solving the eigenvalue problem, 
we obtain the eigenvector corresponding to the second smallest eigenvalue of 
Eq. [6]. It is prove-l 33 that an approximate solution for the optimum partition 
of the graph is giv *n by considering the positive and negative elements of the 
chosen eigen vectoi as two subsets. Moreover, the eigenvector for the second 
smallest eigenvalue of Eq. [6] can be computed in O(N) time making such an 
approach computationally very efficient. 

The mathema ical derivation of the eigenvector-based procedure is given 
in Ref. f 3 ]. Instead of reproducing it here, we attempt to physically justify it 
by showing analogy to another problem. In quantum mechanics, we describe 
the eigenstates of the hydrogen atom in terms of s, p and d orbitals, all of which 
have different shapes. The s-state is spherical with no nodes, and the p-state is 
dumbbell-shaped with one node. On the other hand, the s-state is the lowest 
energy state and t le p-state state has the second smallest eigen-energy. Since 
energy is an eigenvalue of Schrodinger’s equation, we see that the eigenvector 
corresponding to t ie second smallest eigenvalue attempts to partition space in 
two distinct lobes. Our procedure here is analogous. 

Once we ident ify the optimum partition of the protein using the above- 
mentioned procedure, we need to decide whether to accept the cut or reject 
it. There are ma ly single-domained proteins which need not be split into 
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two domains. Th< decision process is conducted by comparing the obtained 
smallest Ncut value with a predetermined parameter cutoff. If Ncut is higher 
than cutoff, the c: it is rejected and the protein is concluded to be single- 
domained. On the other hand, if Ncut is lower than cutoff, the cut is accepted. 
Each segment is considered to be a domain of the protein and subsequently 
checked for the possibility of further subdivisions by reapplying the algorithm 
on the individual segments. 

The algorithm described above is generally applicable to any partitioning 
problem. For the specific case of protein domain-decomposition, some addi- 
tional modificatioi s need to be made. Following Richardson’s broad guide- 
lines 6 reflecting a commonly accepted viewpoint on the definition of domains, 
we do not accept protein segments less than 40 residues long as domains and 
therefore do not attempt to cut proteins less than 80 residues long. Moreover, 
to avoid fragment; tion, if our algorithm predicts parts of domains which are 
less than 20 residues long, we insert such short segments back into the other 
part of the chain as a post-processing step. 

Additional steps are taken to make the algorithm numerically efficient. 
Firstly, we truncate small terms of W to zero. When the distance between 
two residues is ab< ve 25 A, w%j is truncated to zero. This gives us a sparse Ik 
matrix improving peed at no performance sacrifice. For such sparse W , the 
eigenvalues and eigenvectors in Eq. [6] can be computed in O (N) steps using 
the Lanczos algorithm, an approximate, recursive method. In comparison, the 
normal time for calculation of eigenvalues is 0(N 3 ). Moreover, to improve 
the approximate p irtitioning algorithm, we partition the graph not solely on 
values of the eigenvector above and below zero, but also take few more cutoff 
points near zero t.< see whether the Ncut value improves. Such a procedure is 
originally suggeste 1 in Ref. p 3 ] and we find it to improve the quality of our 
results. 

3 Estimation of Model Parameters 

Four important pa ameters ( d, do and 8 and cutoff ) need to be estimated for 
optimum performance of our algorithm. For calibration of these parameters, 
we choose a method similar to Ref. f]. We consider a calibration set of 206 pro- 
teins 35 for which comains are already known and available in the literature 8 . 
47 of them are two-domained proteins the remaining 159 are single-domained. 
We apply our algorithm to all these proteins with different values of parame- 
ters d, d 0 , 0 and c doff and compare results with known solutions. The values 
of the parameters ior which the domains of most proteins are correctly identi- 
fied are chosen as ; tandard parameters. Based on this analysis, we found that 
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d = 10.33, d 0 = 2, 0 = .01 and cutoff =. 26 give the best results. 

With the chosen form of the W matrix and associated parameters, we 
discuss here the implications of varying the parameters. We find that if d is 
changed to larger /alues, Wij becomes 1 for many residue-pairs. Therefore, the 
resolution of the method goes down and Ncut gets larger. Resolution improves 
for smaller d. H< vvever, if d is too small, all off-diagonal elements become 
very small. So that is not acceptable either. Some middle- value of d is more 
desirable, and the method optimizes for d= 10.33A. On the other hand, do is 
a scaling parameter. It effectively scales up the high-resolution numbers that 
were obtained wit l low d. 


4 Results and Discussion 

We illustrate the general procedure discussed in section [2] with the example 
of T-Cell Surface Glycoprotein (PDB code: 3CD4). This protein has two 
domains clearly id entifiable in Fig. [1(a)]. We represent the protein as a graph 
and compute the adjacency matrix W. The W matrix is shown in Fig. [1] as 
an image plot. Based on the eigenvector corresponding to the second smallest 
eigenvalue in Eq. [*i], we partition the protein into two groups of residues (1,97) 
and (98,178). The value of Ncut for this partition is 0.17 which is much smaller 
than cutoff parameter 0.26. Therefore the cut is accepted and two segments 
are chosen as two domains of the protein. Attempts at further partitioning 
of the individual segments produce Ncut values larger than .26 and therefore 
such subdivisions are rejected. Hence, our algorithm gives two domains for 
the protein, closely matching expert opinion from the literature [(1,98) and 
(99,178)]. 

To compare the overall quality of our method with other methods in the 
literature, we try it on the standard set of 55 proteins from the literature 17 . 
The set of protein ndudes 30 single-domain proteins' 1 , 20 two-domain proteins 
b , two three-doma n proteins c and three four-domain proteins d . Domains for 
all these proteins are available from the literature 8 . Structures of some of the 
proteins in the PDB database have been corrected and saved under new names 
in PDB since the original paper came out in 1998. Therefore, we updated 

“one-domain: 2aai , 2ace, lbbhA, lbbpA, lbrd, lfxiA, lgky, 2gmfA, lgmpA, lgox, lofv, 
lpyp, lrbp, lrcb, IneA, lsnc, ltie, ltlk, lula, lbksA, 2azaA, 2ccyA, 2rn2, 2stv, 2tmvA, 
3chy, 3cla, 3dfrA, 4bl nA, 5p21 

6 two-domain: lezi i, Lfnb, lgpb, llap, lpfkA, lppn, lrhd, lsgt, lvsg, lbksB, 2cyp, 2had, 
3cd4, lg6nA, 3pgk, 4gcr, 5fbp, 8adh, 8atcA, 8atcB 
c three-domain: lprih. 3grs 
^four-domain: lain A, 3pgmA, 8acn 
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Table 1: Domain det omposition of 25 multi-domain proteins of the standard comparison 
set are shown here. Column ’Expert Opinion in Lit.’ shows the domains identified by 
experts in the literature. Column ’Normalized Cut’ shows results from our algorithm. For 
the remaining 30 prot eins which are single-domained, three proteins 2 ACE and 2TMVP are 
identified incorrectly as two-domained by our algorithm. 


Protein 

Expert Opinion in Lit. 

Normalized Cut 

Accuracy 

2 domains: 




1EZM 

(1-13-1), (135-298) 

(1-146,170-197), 

(147-169,198-298) 

87% 

1FNB 

(19-11 l). (162-314) 

(19-153), (154-314) 

98% 

1GPB 

(19-41-!)). (490-841) 

3 domains 

wrong 

1LAP 

(1-150), (171-484) 

(1-158), (159-484) 

100% 

1PFKA 

(0-138 .251-301), 
(139-250,302-319) 

(0-142,255-319), (143-251) 

98% 

1PPN 

(1-10,112-208), (21-111,209-212) 

(1-212) 

wrong 

1RHD 

(1-158), (159-293) 

(1-157), (158-293) 

100% 

1SGT 

(22-123, 234-245), (129-233) 

(16-245) 

wrong 

1VSGA 

(1-29. 92-251), (42-75, 266-362) 

( 1-33,86-255) ,(34-85,256-362) 

100% 

1BKSB 

(9-52, 86-204), (53-85, 205-393) 

(3-53,87-204), (54-86,205-394) 

100% 

2CYP 

(3-145, 266-294), (164-265) 

(2-144, 266-294), (145-265) 

99% 

2HAD 

(1-155, 233-310), (156-229) 

(1-310) 

wrong 

3CD4 

(1-98). (99-178) 

(1-97), (98-178) 

100% 

1G6NA 

(1-129), (139-208) 

(7-128), (129-204) 

100% 

3PGK 

(1-185, 403-415), (200-392) 

(1-195,390-415), (196-389) 

100% 

4GCR 

(1-83) (84-174) 

(1-81), (82-173) 

99% 

5FBP 

(6-20 1'|, (202-335) 

(6-199), (200,334) 

100% 

8ADH 

(1-170. 319-374), (176-318) 

(1-177, 318-374), (178-317) 

99% 

8ATCA 

(1-137, 288-310), (144-283) 

(1-140, 289-309), (141-288) 

100% 

8ATCB 

(8-971 (L01-152) 

(8-98), (99-153) 

100% 

3 domains: 

(1-155), (176-290), (291-394) 

(1-69,100-180,269-341) 


1PHH 

(70-99,181-268,342-394) 
(18-60,108-159,291-364), (61- 

wrong 

(18-157, 294-364), (158-293), 
(365-1781 

3GRS 

107,160-220,242-290), (221- 

241,365-478) 

87% 

4 domains: 

(1-32. 70-144 , 338-372), (33- 

(1-33,69-136,339- 


1ATNA 

69), (145-180, 270-337), (181- 

372), (34-68,186-257), (137- 

95% 


269) 

185,258-338) 


3PMGA 

(1-188), (192-315), (325-403), 
(408-561 i 

1 

(1-193), (194-419), (420-561) 
(2-67,151-208,231-312), (68- 

wrong 

8ACN 

(2-200), (201-317), (320-513), 

150,510-530), (209-230,313- 

wrong 


(538-754 1 

508), (531-754) 
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proteins laak by 2aak, lace by 2ace, lgmfA by 2gmfA, lwsy by lbks, lfnr by 
lfnb, 2pmg by 3p ng and 3gap by lg6n to reflect this change. 

The results of our comparison are shown in table[l]. Using the definition of 
correctness of Islam 8 ’ 7 , our method gives 84 percent accuracy. We have nine 
incorrect results proteins with PDB codes 2ace, 2tmvP, Igpb, lppn, lsgt, 
2had, lphh, 3pmgA and 8acn. 

By analyzing the Ncut numbers for the proteins, for which our algorithm 
failed we note tha; they resulted in Ncut values with a much smaller subrange 
out of maximum range of (0 - 2). This allows us to suggest a better method 
for protein domain decomposition. We can consider all the proteins and then 
for the ones where the cut value is outside the subrange, we can surely say that 
the cut can be accepted. On the other hand, for the proteins where the cut- 
value is within a small subrange ’gray- area’, we may discuss with an expert to 
accept or reject th \ cut. The whole range (0,2) of minimum Ncut values can be 
divided into (i) su e cut (0- .20), gray region (.20- .30) and sure not a domain 
(.31 - 2). This will greatly reduce the burden on human experts maintaining 
the PDB database, because they can concentrate on a smaller group of proteins 
rather than all th< proteins. We note that using this strategy, we can correctly 
identify all the pioteins in Jones list except three. These three are proteins 
with PDB codes lsgt, 2had and latn, some which have history of difficulty of 
identification even by human experts 18 . 


5 Conclusion 


In this paper, we use a powerful, normalized-cut based approach to partition 
proteins into domains. This graph-theory based approach, borrowed from the 
image-partitioning field, shows a good rate of success when applied to the 
protein domain decomposition problem. A computer implementation of our 
algorithm, tried or a commonly used standard test-set of 55 proteins 17 , obtains 
84 percent rate of success, higher than most other existing algorithms 17,7 . Also 
this method needs very few adjustable parameters and runs in linear time with 
respect to the size of protein. Moreover, we find that even for proteins where 
our method gives incorrect results, the cut-value falls within a narrow range 
out of all possible values. Therefore, our method will be useful for automatic 
domain databases as well as semi-automatic ones where the aid of a human 
expert is taken foi difficult proteins. 
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